Skip to content

Task introduction

Text-to-video retrieval aims to automatically locate the video segments most semantically relevant to a given natural language description from a large-scale video database.

Evaluation Dataset

MSR-VTT

Data description:

MSR-VTT stands for Microsoft Research Video to Text, is a large-scale data set containing videos and corresponding text annotations. It consists of 10,000 video clips from 20 categories. Each video clip contains 20 English sentence annotations.

Dataset structure:

Amount of source data:

The dataset is split into train(6513), validation(497), test(2990), each video has 20 captions.

Data detail:

KEYSEXPLAIN
vidvideo
textscaptions of the video

Sample of source dataset:

vid:
Alt text
texts:

  1. a baker is demonstrating a cooking technique
  2. a female giving a baking demonstration in her kitchen
  3. a girl explaining to prepare a dish
  4. a lady with a scarf is cooking with dough
  5. a person is preparing some food
  6. a person making pastries
  7. a woman is making a pastry
  8. a woman is rolling doe
  9. a woman is rolling dough around a stick
  10. a woman is rolling dough
  11. a woman is rolling dough
  12. a woman is wrapping dough around some food item
  13. a woman rolling up pastry while giving instructions
  14. a woman rolls dough
  15. a woman showing an easy way to make crescent rolls
  16. how to prepare food rolls
  17. the pastry should have five creases
  18. a person is preparing some food
  19. a woman is rolling dough around a stick
  20. a woman rolls dough

Citation information:

@inproceedings{xu2016msr-vtt,
author = {Xu, Jun and Mei, Tao and Yao, Ting and Rui, Yong},
title = {MSR-VTT: A Large Video Description Dataset for Bridging Video and Language},
year = {2016},
month = {June},
publisher = {IEEE International Conference on Computer Vision and Pattern Recognition (CVPR)},
}

UCF-101

Data description:

UCF101 is a video dataset with 101 action categories collected from YouTube by the University of Central Florida, containing a total of 13,320 videos.

Dataset structure:

Amount of source data:

The dataset is split into train(9537) and test(3783).

Data detail:

KEYSEXPLAIN
vidvideo
labelthe label of the video

Sample of source dataset:

vid:
Alt text

label:
Playing Basketball

Citation information:

@article{soomro2012ucf101,
  title={UCF101: A dataset of 101 human actions classes from videos in the wild},
  author={Soomro, Khurram and Zamir, Amir Roshan and Shah, Mubarak},
  journal={arXiv preprint arXiv:1212.0402},
  year={2012}
}